A typical data science project involves wrestling and teasing out meaningful information from real-world data. The end result can take the form of beautiful charts provided in dashboards to end-users, or in the form of probabilistic models that can sieve through the noise and predict based off underlying trends.
Often, the trend that interests you would be hidden amongst noise. Noise here can refer to actual "junk" interference (e.g. line noise in cables, grainy camera footage), or it could actually be pretty interesting phenomena that just so happens to be a hindrance when mixed in to your current task ("one man's trash is another man's treasure")!
One pretty useful way to look at new data/problems is from the frequency perspective. This is especially useful for cases where the data has an inherent "ordered" characteristic (time, space etc).
This post will not cover the deep math and theories that come with a typical undergraduate EECS course, but will instead go through the intuitions and basic practical uses of such techniques.
We start off with a simple time-series weather dataset, grabbed from here. Off the bat, we can see that there's quite a bit of variation day-to-day, along with a slower moving trend from across the months/years. Which is the signal and which is the noise?
Well, that depends on what you're interested in extracting! For example, if your task is to figure out which is the best month to ... boil an egg outdoors(?), you'll probably be interested in the slower month-to-month trend, instead of the daily fluctuations. Conversely, if you're interested to see if commuting patterns are influenced by the weather, you'll want to keep an eye on the daily fluctuations.
Either way, you'll be interested in separating the high frequency (daily) components of the signal from the low frequency (seasonal) ones, after which you can choose to discard one or the other. Filter design can get quite math-y, so let's start with a simple moving average (SMA) filter instead!
window_size = 10
def get_moving_average(signal, window_length):
padding = int(np.ceil(window_length/2))
signal = np.pad(signal, padding, "edge")
smoothed_signal = np.convolve(signal, np.ones(window_length), 'same') / window_length
return smoothed_signal[padding:-padding]
smoothed_temperature = get_moving_average(temperature, window_size)
In essence, SMA filters are defined by just one hyperparameter -- window length. Here, we show an illustration of a 5-term SMA filter being applied on the top time series, resulting in the bottom filtered output. At each step, the filter coefficents (in this case all 1/N) are multiplied with the corresponding input signal value (top row), resulting in the intermediate decimal values shown. These are then summed up, resulting in the filtered output values (bottom row). This multiplication and summation chain forms the basis of convolution.
Here, we normalize the terms to 1/N such that the resulting output is basically an average of the 5 input values -- hence the name Simple Moving Average. Because the output takes into consideration several neighboring input values, it "filters away" some of the high frequency components (notice how the extreme values 49 and 10 got pulled to the 20-30s range?) while preserving the overall trend. This makes it the simplest Low-pass filter in existence!
(Note 1: in the illustration, the output sequence is truncated at the head and tail. This is because our 5-term filter would ideally be matched with 5 valid terms from the input sequence for a proper "averaging" operation. In practice, this can be partially adddressed by padding the input signal before the convolution.)